MC-MinH: Metagenome Clustering using Minwise based Hashing

نویسندگان

  • Huzefa Rangwala
  • Zeehasham Rasheed
چکیده

Current bio-technologies allow sequencing of genomes from multiple organisms, that co-exist as communities within ecological environments. This collective genomic process (called metagenomics) has spurred the development of several computational tools for the quantification of abundance, diversity and role of different species within different communities. Unsupervised clustering algorithms (also called binning algorithms) have been developed to group similar metagenome sequences. We have developed an algorithm called MC-MinH that uses the min-wise hashing approach, along with a greedy clustering algorithm to group 16S and whole metagenomic sequences. We represent unequal length sequences using contiguous subsequences or k-mers, and then approximate the computation of pairwise similarity using independent min-wise hashing. The performance of our algorithm is evaluated on several real and simulated metagenomic benchmarks. We demonstrate that our approach is computationally efficient and produces accurate clustering results when evaluated using external ground truth.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimal Densification for Fast and Accurate Minwise Hashing

Minwise hashing is a fundamental and one of the most successful hashing algorithm in the literature. Recent advances based on the idea of densification (Shrivastava & Li, 2014a;c) have shown that it is possible to compute k minwise hashes, of a vector with d nonzeros, in mere (d + k) computations, a significant improvement over the classical O(dk). These advances have led to an algorithmic impr...

متن کامل

Approximately Minwise Independence with Twisted Tabulation

A random hash function h is ε-minwise if for any set S, |S| “ n, and element x P S, Prrhpxq “ minhpSqs “ p1 ̆ εq{n. Minwise hash functions with low bias ε have widespread applications within similarity estimation. Hashing from a universe rus, the twisted tabulation hashing of Pǎtraşcu and Thorup [SODA’13] makes c “ Op1q lookups in tables of size u1{c. Twisted tabulation was invented to get good ...

متن کامل

b-Bit Minwise Hashing in Practice: Large-Scale Batch and Online Learning and Using GPUs for Fast Preprocessing with Simple Hash Functions

ABSTRACT Minwise hashing is a standard technique in the context of search for approximating set similarities. The recent work [27] demonstrated a potential use of b-bit minwise hashing [26] for batch learning on large data. However, several critical issues must be tackled before one can apply b-bit minwise hashing to the volumes of data often used industrial applications, especially in the cont...

متن کامل

Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)

Our recent work on large-scale learning using b-bit minwise hashing [21, 22] was tested on the webspam dataset (about 24 GB in LibSVM format), which may be way too small compared to real datasets used in industry. Since we could not access the proprietary dataset used in [31] for testing the Vowpal Wabbit (VW) hashing algorithm, in this paper we present an experimental study based on the expand...

متن کامل

b-Bit Minwise Hashing for Estimating Three-Way Similarities

Computing1 two-way and multi-way set similarities is a fundamental problem. This study focuses on estimating 3-way resemblance (Jaccard similarity) using b-bit minwise hashing. While traditional minwise hashing methods store each hashed value using 64 bits, b-bit minwise hashing only stores the lowest b bits (where b ≥ 2 for 3-way). The extension to 3-way similarity from the prior work on 2-way...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013